13 research outputs found
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Modeling textual or visual information with vector representations trained
from large language or visual datasets has been successfully explored in recent
years. However, tasks such as visual question answering require combining these
vector representations with each other. Approaches to multimodal pooling
include element-wise product or sum, as well as concatenation of the visual and
textual representations. We hypothesize that these methods are not as
expressive as an outer product of the visual and textual vectors. As the outer
product is typically infeasible due to its high dimensionality, we instead
propose utilizing Multimodal Compact Bilinear pooling (MCB) to efficiently and
expressively combine multimodal features. We extensively evaluate MCB on the
visual question answering and grounding tasks. We consistently show the benefit
of MCB over ablations without MCB. For visual question answering, we present an
architecture which uses MCB twice, once for predicting attention over spatial
features and again to combine the attended representation with the question
representation. This model outperforms the state-of-the-art on the Visual7W
dataset and the VQA challenge.Comment: Accepted to EMNLP 201
Attentive Explanations: Justifying Decisions and Pointing to the Evidence (Extended Abstract)
Deep models are the defacto standard in visual decision problems due to their
impressive performance on a wide array of visual tasks. On the other hand,
their opaqueness has led to a surge of interest in explainable systems. In this
work, we emphasize the importance of model explanation in various forms such as
visual pointing and textual justification. The lack of data with justification
annotations is one of the bottlenecks of generating multimodal explanations.
Thus, we propose two large-scale datasets with annotations that visually and
textually justify a classification decision for various activities, i.e. ACT-X,
and for question answering, i.e. VQA-X. We also introduce a multimodal
methodology for generating visual and textual explanations simultaneously. We
quantitatively show that training with the textual explanations not only yields
better textual justification models, but also models that better localize the
evidence that support their decision.Comment: arXiv admin note: text overlap with arXiv:1612.0475
Multimodal Explanations: Justifying Decisions and Pointing to the Evidence
Deep models that are both effective and explainable are desirable in many
settings; prior explainable models have been unimodal, offering either
image-based visualization of attention weights or text-based generation of
post-hoc justifications. We propose a multimodal approach to explanation, and
argue that the two modalities provide complementary explanatory strengths. We
collect two new datasets to define and evaluate this task, and propose a novel
model which can provide joint textual rationale generation and attention
visualization. Our datasets define visual and textual justifications of a
classification decision for activity recognition tasks (ACT-X) and for visual
question answering tasks (VQA-X). We quantitatively show that training with the
textual explanations not only yields better textual justification models, but
also better localizes the evidence that supports the decision. We also
qualitatively show cases where visual explanation is more insightful than
textual explanation, and vice versa, supporting our thesis that multimodal
explanation models offer significant benefits over unimodal approaches.Comment: arXiv admin note: text overlap with arXiv:1612.0475
Recommended from our members
Vision and Language Understanding Through Generative Modeling
Language is such a powerful representation for capturing the knowledge and information about our world. It excels at expressing discrete concepts such as objects and their attributes, the relationships between them in a very compact manner all due to its extremely high level of abstraction. Language is the primary means by which we communicate, comprehend, and express our thoughts and ideas, and it lies at the very core of human intelligence. With the advent of powerful generative models, machines also have begun to comprehend and generate natural language with notable fluency and creativity. However, they lack “grounding”—a direct tie to the visual world. Vision plays a pivotal role in our comprehension and production of language. When we describe a scene, understand instructions, or engage in a dialogue, visual contextsignificantly aids our interpretation and generation of language. This highlights the need for integrating vision for generative modeling.
Chapter 1 and 2 delve into image-to-text domain, spotlighting the importance of a multimodal approach for text generation. In Chapter 1, we explore how generating textual rationales with attention visualizations can enhance model transparency for
visual question answering. In Chapter 2, we build generative models that abandon traditional left-to-right sequencing in favor of an unsupervised technique to determine optimal generation orders. Chapter 3 and 4 shift the focus to text-to-image generation. In Chapter 3, we introduce a training-free framework that combines linguistic cues with reference images, allowing for controllable image synthesis using denoising diffusion probabilistic models. Lastly, Chapter 4 emphasizes the importance of preserving object shapes in text-based image editing, proposing a unique mechanism that augments text-to-image models to be more faithful to input masks and text prompts
Statistical Analysis of Low-latitude Pi2 Pulsations Observed at Bohyun Station in Korea
We statistically investigated the properties of low-latitude Pi2 pulsations using Bohyun (BOH, Mlat = 29.8°, L = 1.35) ground magnetometer data in 2008. For this 1-year interval, 582 Pi2 events were identified when BOH was in the nightside from 1800 to 0600 local times. We found the following Pi2 characteristics. (1) The occurrence distribution of Pi2s is relatively constant in local times. (2) The Pi2 frequency varies in local times. That is, Pi2 pulsations in postmidnight sector had higher frequency than in premidnight sector. (3) Pi2 power in premidnight sector is stronger than in postmidnight sector. (4) Pi2 frequency has positive correlation with solar wind speed and AE index. (5) Pi2 power has not a clear correlation with solar wind parameters. This indicates that Pi2 power is not controlled by external sources. (6) It is found that the most probable-time between Pi2 onsets is Δt ~ 37.5 min: This is interpreted to be the period between Pi2 pulsations when they occur cyclically. We suggest that Δt ~ 37.5 min is the occurrence rate of reconnection of open field lines in the tail lobe